P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7) by mastermanas805 · Pull Request #15 · InstaNode-dev/infra

mastermanas805 · 2026-05-20T11:01:40Z

Diagnosis

The api Deployment in k8s/app.yaml was missing two pieces needed for graceful shutdown:

terminationGracePeriodSeconds — defaulted to 30s, which collides with
the api's drain budget: preStop sleep 5 + readinessDrainGrace 3 + ShutdownWithTimeout 25 = 33s of in-process work. Kubelet was SIGKILLing mid-drain.
preStop lifecycle hook — without it, the kubelet sends SIGTERM
immediately on pod termination. The LB doesn't refresh Service endpoints
until the readinessProbe fails on the next tick — so new traffic kept
landing on a pod that was about to stop accepting connections.

Diff Summary

k8s/app.yaml:

New terminationGracePeriodSeconds: 35 on the api pod spec
(budget: preStop 5s + readinessDrainGrace 3s + shutdownTimeout 25s + safety 2s).
New lifecycle.preStop.exec.command: ["/bin/sh", "-c", "sleep 5"]
on the api container — gives the kubelet a window to observe the api's
/readyz 503 flip (via hooks.Readyz.MarkDraining in the api repo's
companion PR) and update Service endpoints before SIGTERM is delivered.

Required Companion PR

api repo — ship/p0-7-graceful-shutdown-readiness-2026-05-20
adds MarkDraining to /readyz + wires hooks.Readyz.MarkDraining() into the
SIGTERM handler. Must land together — this manifest change alone widens the
grace period but doesn't flip /readyz to 503, so the LB still routes new
traffic to a draining pod.

Live Verify Plan (post-merge)

kubectl apply -f k8s/app.yaml (or whichever path is canonical for infra).
kubectl rollout restart deploy/instant-api -n instant
kubectl describe pod mid-roll shows preStop running, then probe failing, then container exit.
kubectl get events -n instant --sort-by='.lastTimestamp' | tail -20 — no
'FailedKillPod' / 'killed before terminationGracePeriod' events.

🤖 Generated with Claude Code

…MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three new Prometheus alerts tied to the worker repo's PASS 3 enhanced reasons + PASS 6 stuck-build counters: - OrphanSweepNoDBRowReap (CRITICAL, 1h): a k8s namespace had no backing deployments row — the P0-3 atomic-provision symptom. Pages on first occurrence over 1h. - OrphanSweepStuckBuildSpike (WARNING, 15m): >5 stuck-build flips in 15m means the kaniko/GHCR build pipeline is degraded for many customers at once. - OrphanSweepReapFailureRate (WARNING, 30m): the reconciler detected orphans it cannot reap (k8s/DB write failure sustained). The counters land in worker master commit 7d2ff0d; the alerts go live once the deploy lands + scrape picks them up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 and others added 2 commits May 20, 2026 16:26

feat(k8s/app.yaml): terminationGracePeriodSeconds=35 + preStop hook (…

318c5b3

…MR-P0-7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit 7ad904e into master May 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7)#15

P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7)#15
mastermanas805 merged 2 commits into
masterfrom
ship/p0-7-api-grace-period-2026-05-20

mastermanas805 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented May 20, 2026

Diagnosis

Diff Summary

Required Companion PR

Live Verify Plan (post-merge)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant